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Abstract 

Changing base composition during the evolution of biological sequences can mislead 
some of the phylogenetic inference techniques in current use. However, detecting 
whether such a process has occurred may be difficult, since convergent evolution 
may lead to similar base frequencies emerging from different lineages. 

To study this situation, algebraic models of biological sequence evolution are intro- 
duced in which the base composition is fixed throughout evolution. Basic properties 
of the associated algebraic varieties are investigated, including the construction of 
some phylogenetic invariants. 
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1 Introduction 



Models of molecular evolution, such as for DNA sequences, typically assume 
evolution occurs along a bifurcating tree, proceeding from a root representing 
the common ancestral sequence, toward the leaves representing the descen- 
dent sequences. At each site in the sequence, bases mutate according to a 
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probabilistic process that depends upon the edge of the tree. Usually only the 
sequences at the leaves of the tree can be observed, while sequences at internal 
nodes correspond to hidden variables in this graphical model. A fundamental 
problem of sequence-based phylogenetics is to infer the tree topology from 
observed sequences, assuming some reasonable model. 

In the works of Cavendar and Felsenstein [CF87] and Lake [Lak87], the con- 
nections between this problem and algebraic geometry first emerged in the 
phylogenetics literature. Under many standard models of molecular evolution, 
for a fixed tree topology the joint distribution of bases in the leaf sequences 
are described by polynomial equations in the parameters of the model, thus 
parameterizing a variety associated to the tree. The defining polynomials of 
this variety, called phylogenetic invariants, are polynomials that vanish on any 
joint distribution arising from the tree and model, regardless of parameter val- 
ues. Finding phylogenetic invariants for various models has been of interest 
both for providing theoretical understanding, and in hopes that methods of 
phylogenetic inference that do not require parameter estimation may be de- 
veloped. See [Fel03]. 

For certain models, much progress has been made on determining invariants. 
Key advances for group-based models such as the Kimura three-parameter, 
were made in [ES93] and [SSEW93], which built on the Hadamard conjuga- 
tion introduced in [HP93]. Recently, Sturmfels and Sullivant [SS04] further 
exploited the Hadamard conjugation to recognize these varieties were toric, 
completing the determination of all invariants in this case. For the general 
Markov model, Allman and Rhodes [AR03], found new constructions of in- 
variants, though the complete determination of the ideal is still open. 

In this paper, we consider models that lie between group-based models and the 
general Markov model. Specifically, we assume that a fixed vector describes 
the relative frequencies of the bases in sequences at every node of the tree, so 
that the base composition of sequences remains stable throughout evolution. 

Our motivation for this assumption is a biological one. Many of the models 
currently assumed in performing inference with real data make an assump- 
tion of a stable base composition (e.g., all group-based models, the general 
time-reversible model). However, there are data sets in which base composi- 
tion seems to have changed during evolution, as reflected in comparisons of 
the sequences at the leaves. Although the extent to which this issue is prob- 
lematic in real data sets is controversial, a number of authors have pointed 
out that changing base composition may mislead some methods of inference, 
especially if it results in convergent mutations in different parts of the tree. 
See [LAHP94,CL01,RK03] and their references. 

In [KG01] a 'disparity index' was introduced as a simple statistical test that 
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might indicate inhomogeneity of the mutation process along the different edges 
of the tree. This index is based on a pair- wise comparison of base compositions 
of sequences at the leaves. It is, however, possible that all leaf sequences have 
the same base composition, while an internal node sequence has a different one. 
Indeed, this is exactly the issue with convergent mutations; base composition 
may appear to be the same in observed sequences, yet it differed in the common 
ancestral sequence. If a model is chosen only through comparing the base 
compositions of sequences at the leaves, it may be an inappropriate one. 

Better understanding the constraints placed on the joint distribution of bases 
in sequences from various taxa by an assumption of a stable base distribution is 
therefore desirable. To begin investigating this issue, in section 2 we introduce 
three models of molecular evolution that include an assumption of stable base 
distribution. When the number of bases in the model is k = 2, the three models 
are the same, and its structure allows us to give a more in-depth analysis 
than for general k. This is the subject of section 3, where we give a rational 
map inverting the parameterization, and find the full ideal of phylogenetic 
invariants for the 3-taxon tree. In section 4, the case of general k is considered. 
Basic facts about the associated phylogenetic varieties, such as their dimension 
and irreducibility/non-irreducibility, are investigated. Although our knowledge 
of phylogenetic invariants is incomplete, we give constructions of some for these 
models. 



2 The Models 



Let T denote an undirected bifurcating tree, with n leaves labeled by the 
taxa ai, a 2 , . . . , a n . If r is some vertex in T, either internal or terminal, we 
use T r to denote the tree rooted at r. We view T r as a directed graph, with 
all edges directed away from r forming a set Edge(T r ). Thus T r represents a 
hypothetical evolutionary history of the taxa in their descent from a common 
ancestor at r. For simplicity, we refer to T r as a rooted n-taxon tree. 

We model the evolution along T r of sequences composed from an alphabet 
[k] = {1,2, ...,k} of bases or states (e.g., k = 4 for DNA). A root distri- 
bution vector p r = (pi,P2, • • • ,Pk), with pi e [0,1], J2iPi — 1; describes the 
frequency of bases in an ancestral sequence. To each e G Edge(T r ) we asso- 
ciate a k x k Markov matrix M e (with entries in [0,1], each row summing 
to 1) whose (i, j)-entry specifies the conditional probability of base i at the 
initial vertex mutating to base j at the final vertex of the edge. Together p r 
and {M e } e( zEdge(T r ) comprise the parameters of the model. If no additional re- 
quirements are placed on p r or the M e , then we have described the general 
Markov model (GM) of sequence evolution, studied in [AR03]. 
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Letting Xgm,k,t t denote the parameter space for the At-state GM model on T r , 
we can view Xgm,k,t t as a subset of [0, 1] M for M — k + k 2 E with E = In — 3, 
the number of edges of T. We have a map 

= <P G M,K,T r ■ X GM , K ,T r -> [0, If" C C K ", 

so that 4>{x) gives the joint distribution of bases in aligned sequences at the 
leaves arising from the parameter choice x. Specifically, <f>(x) = P = (pj 1 j 2 ...j n ), 
a k x • • • x k tensor with entries 

Phh—jn = Pir IT M e (i v ,i w ), 

ieT(ji,32,—,j n ) e£Edge(T r ), 

where J(ji,j 2 , • • • , j n ) = {(Q I v G Vert(T r ),i v G [/c],i 0fc = j fc } C [/t] 2 "" 2 . Note 
is a polynomial map, viewed as a function of the entries of p r and M e , and 
extends to a polynomial map C M — > C K " which we also denote by 0. 

In this paper we are interested in submodels of the GM model, in which we 
have stable base frequencies at all vertices in the tree. We introduce three such 
models, defined by imposing additional restrictions on parameters of the GM 
model. After formally defining the models, we will motivate their assumptions 
and names. 

• Stable Base Distribution Model (SBD): 1) p r has no zero entries, and 2) p r 
is fixed by all M e ; that is, p r M e = p r for all edges e. 

• Simultaneous Diagonalization Model (SD): In addition to the assumptions 
of SBD, 3) with D r = diag(p r ), all matrices in 

{M e | e G Edge{T r )} U {D^M^D,. \ e G Edge(T r )} 

commute with one another. 

• Algebraic Time Reversible Model (ATR): In addition to the assumptions of 
SD, 4) for all edges e, M e = D; l M?D r . 

We will also need: 

Definition 1 For any model M. formed from the n-state GM model by impos- 
ing additional assumptions on the parameters, and for any rooted n-taxon tree 
T r , we let Xm,k,t t denote the parameter space of M. on T r . Then the algebraic 
variety V(AA,k, T r ) is the Zariski closure in C K " of 4>(XM,K,T r )- 

We now expand upon the model definitions. First, assuming p r has no zero 
entries, the matrices D~ 1 MjD r are also Markov matrices fixing p r . They 
arise naturally as follows: Consider a 2-taxon tree consisting of a single edge 
e from vertex r to vertex s, with model parameters p r and M e . Then the 
joint distribution of bases in aligned sequences at r and s arising from these 
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parameter choices is given by the entries in D r M e . Assuming M e fixes p r , so 
that p r is also the base distribution of a sequence at s, then the identity 

D r M e = (D r (D~ 1 MjD r )) T 

shows that the model parameters p r , M e on the 1-edge tree rooted at r lead 
to the same joint distribution as the model parameters p r , D~ x M^D r on the 
1-edge tree rooted at s. More generally, as shown in [SSH94,AR03] for the GM 
model, parameters for the SBD, SD, or ATR model on T r produce the same 
joint distribution as the set of parameters on T s for any other vertex s, simply 
by defining p s = p r , and for those edges whose directions have reversed in 
changing the root location, replacing M e by D~ 1 Mj D r . In particular, we see 

Proposition 2 4>(XM,K,T r ) and V(Ai, K,T r ) are independent of the choice of 
r for M.=SBD, SD, and ATR. Thus, for these models, V(A4, k,T) is well- 
defined without reference to r. 

Second, the requirement for the SD and ATR model that the specified collec- 
tion of matrices commute is, in fact, equivalent to an assumption that those 
matrices are simultaneously diagonalizable. To see this, first note that the 
commutation assumption is equivalent to the commutation of the collection 

{Dl /2 M e D; 1/2 | e G Edge(T r )} U {D; 1/2 MjD] /2 \ e G Edge(T r )}. 

But this implies in particular that each matrix D l J 2 M e D~ 1 / 2 is normal, and 
hence diagonalizable. Commutativity then implies the existence of simulta- 
neous eigenvectors for this collection, and hence for the original collection. 
Conversely, if the matrices are simultaneously diagonalizable, they certainly 
commute. 

Third, the ATR model is related to the general time-reversible model (GTR) 
often used in phylogenetic studies. The GTR assumes that for each edge e, 
M e = exp(Rt e ), where t e is a scalar parameter and R is a rate matrix (with 
rows summing to 0) common to all edges with the properties that D r R is 
symmetric and p r R = ([Fel03]). A collection of Markov matrices arising from 
GTR parameters thus satisfies the hypotheses of the ATR model. However, 
the common rate matrix assumption of the GTR imposes a relationship among 
the logarithms of the eigenvalues of the Markov matrices M e which the ATR 
does not, and thus the ATR is more amenable to algebraic analysis. 

Finally, we note that the group-based models, such as the Kimura 3-parameter 
(KST), can be viewed as the ATR together with additional assumptions on 
the eigenvectors of the M e . For instance, KST requires the eigenvectors be 
the columns of a 4 x 4 Hadamard matrix, with (1, 1, 1, 1) the stable base 
distribution. 

We summarize the relationships of the various algebraic models with the in- 
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elusions 



Group-based C ATR CSDC SBD C GM. 



For k > 3, these inclusions are all strict, though for k — 2 the three central 
models are identical, as shown below. Of course, the associated varieties are 
related by a reversed chain of inclusions. 



3 The 2-base model 

For k — 2, the SBD, SD, and ATR models are all the same. To see this, and 
fix notation for future use, consider the SBD model, with root distribution 
vector p r = (p, 1 — p) = (p, q). Since each matrix M e has left eigenvector p r 
and right eigenvector (1,1), both with eigenvalue 1, we readily find we can 
express 



thus associating a single scalar parameter m e to each edge. We also see that 
M e satisfies the hypotheses of the ATR model as well. (In fact, for k = 2, the 
ATR model and the GTR model also coincide.) The form of M e allows us to 
identify parameters for an n-taxon tree with a point (p; {m e }) e M 2n ~ 2 . 

We first consider a 3-taxon tree T r , rooted at its central node, with three edges 
ei, and e 3 leading from r to leaves a±, a 2 , and a 3 . Labeling the states 
and 1, and using these as indices to refer to matrix entries corresponding to 
the states, the joint distribution of bases at a site in sequences at the leaves 
is now described by a 2 x 2 x 2 tensor P = (pijk), where 



Viewing P as a polynomial function of p,mi,m2,m 3 , we thus have a map 
ip : C 4 — > C 8 , and readily see that the Zariski closure of the image of ip is 



For notational ease, we follow the convention that replacing an index by the 
symbol '+' indicates marginalization over that index. For instance, pij + = 
EkPijk, while p i++ = Y,j,kPijk- 

Proposition 3 The following rational map provides an explicit inverse to the 




p ijk = pM 1 (0, i)M 2 (0,j)M 3 (0, k) + qM.il, i)M 2 (l,j)M 3 (l, k). 



V(SBD,k,T). 
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parameterization of the map ip: 



P = Po++, 

m _ i E tJ k(-iy +:i+k Pi j kP(i- l )++P(i-j)++P(i-k)++ 

(pi++ - Po++K 

where d 1 = det(p +i j), d 2 = det(p i+ j), and d 3 = det(py + ). 

PROOF. Define a 2x2x2 diagonal tensor L> with D(0, 0,0) = p, L>(1,1,1) = 
g, and all other entries zero. We then have 

i 

Pijk 

]T D(l,m,n)M 1 (l,i)M 2 (m,j)M 3 (n,k), (1) 

l,m,n=0 

expressing P as the result of an action of an element of GL 2 x GL 2 x GL 2 
on D. Also observe that each matrix Mj has as right eigenvectors (1,1) and 
(—q,p), with eigenvalues 1 and 1 — m 8 , respectively. Thus multiplying the 
tensor P, whose entries are polynomials in p, mi, m 2 , and m 3 , by the vector 
v = (i>o,i>i) = (—q,p) along each of its indices, yields 

i 

90 = PijkViVjVk (2) 

i,j,fc=0 

1 1 

= E E D(l,m,n)M 1 (l,i)v i M 2 (m,j)v j M 3 (n,k)v k . 

i,j,k=0 l,m,n=0 

Interchanging summations, and using that v is an eigenvector of each of the 
Mi yields 

i 

9o = E (i-m 1 )(l-m 2 )(l-m 3 )D(l,m,n)viv m v n 

l,m,n=0 

= {l-mi)(l-m 2 )(l-m 3 )pq(p-q). (3) 

Multiplying similarly, with two copies of v and one of (1,1) yields 

i 

9i= PijkVjV k = (1 - m 2 )(l - m 3 )pq, 

i,j,k=0 
1 

92= PijkViV k = (1 - mi)(l - m 3 )pq, (4) 

i,j,k=0 

1 

93= PijkViVj = (1 - mi)(l - m 2 )pq. 

i,j,k=0 
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Now since p = po++, if we express the entries of v as linear polynomials in 
the Pijk, we may view the gi as polynomials in the p^ as well. Then from 
equations (2) and (4) we see that g is of degree 4, while g±, g 2 , and g 3 are 
each of degree 3. 

A calculation shows all four of these polynomials have a factor of s = Y^\j k=oPijki 
which of course evaluates to 1 on V, so we may replace each with gi = gi/s, if 
desired. We also note that explicit expressions for the quadratic as ordinary 
matrix determinants can be given: 

9i = det (p+ij)> 92 = det (pi +j ), h = det(p ij+ ). 

Equations (3) and (4) now lead directly to formulas for the m^, 

m = 1 - - 90 , for i = l,2, 3, 

{vo + v 1 )g i 

which yield the stated map. □ 



The explicit invertibility of the parameterization map for the 3-taxon tree 
readily extends to n-taxon trees. 

Theorem 4 Suppose T r is a rooted n-taxon tree with (p; {m e } e€ Edge(T r )) £ 
C 2n ~ 2 defining p r — (p, 1 - p), M e = M(m e ) and 

P = (Phi 2 ...i n ) = 0(Pr; {M e } eeEdge(Tr) ). 

Then the polynomial map ip : (p; {m e } eeEdge ( Tr )) i— > P is inverted by a rational 
map explicitly given by the formulas: 

1) P = Po++...+. 

2) For each terminal edge eo, assume without loss of generality that eo = (v — > 
cii). Choose two other taxa a 2 , a 3 such that the path from a 2 to a 3 in T passes 
through v. Then 

_ =1 _ E^fc(-i) j+i+ W+---+P(i-o+---+P(i-i)+---+P(i-fc)+---+ 
60 (pi+... + -po + ... + )det(p+ ij+ ... + ) 



3) For each internal edge e = (v — > w), chose four taxa which, without loss 
of generality, we assume are ai,a2,a^,a !Sc , such that the path joining a x to a 2 
in T passes through v, but not through w; and the path joining a 3 to a 4 passes 
through w, but not v. Then 

= 1 (Y,i jk (-^) i+j+k Vijk+---+V(i-i)+---+V(i-^ 

(Ei jk (-±y +j+k p+ijk+---+p(i-i)+---+p(i~j)+^ ' 
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PROOF. The formula for p is clear. For the remaining formulas, note that 
because we are dealing with the ATR model, the root location may be changed 
without changing the variety, and while moving the root may change a direc- 
tion of an edge e, the matrix M e is unchanged. 

For a terminal edge eo as described above, the 3-dimensional tensor (pijk^ h) = 

0(p;Mi,M 2 ,M 3 ) for a 3-taxon tree T' v , where = UeePath(v, ai ) M e with 
Path(v, Oj) the set of edges in the path joining v to Oj. In particular, Mi = M eo , 
so applying the formula of Proposition 3 for mi yields the desired formula. 

Similarly, for an internal edge eo as described above, start with the 3-dimensional 
tensor (pij k+ ... + ) = cf) T i(p; M 1 , M 2 , M 3 ) for the 3-taxon tree T' v . Then, since 
M(m)M(m') = M(to") is equivalent to (1 —m)(l — to') = 1 — to", by applying 
the formula of Proposition 3, we find 

eePath(v,as) ^ ( Pl+ ... + - p 0+ ...+) det(p ij+ ... + ) 



Likewise, considering the 3-dimensional tensor (p +i jk+...+), we find 

TT n m \= ^ijk(- i y +1+k P+ijk+---+P(l-i)+---+P(l-j)+---+P(l-k)+---+ 
eePath(w,a 3 ) ^ (pi+...+ - P0+-+) det( P+i+fc+ ... + ) 



Since UeePathiv^i 1 ~ m e) = (1 - m eo )UeePath(w,a 3 )(l - me), this yields the 
given formula. □ 



Now, to determine phylogenetic invariants for the SBD model with k — 2, 
we first consider the 3-taxon tree T. We seek all polynomials in the pijk that 
vanish on y?(C 4 ), and thus define V(SBD, 2, T). 

As we are considering a submodel of GM, we obtain the stochastic invariant, 
which defines V(GM,2,T): 

fo = 1-P+++. 



Several other invariants for the stable base composition model are easily found. 
The distribution of bases in a sequence at a* is given by the vector where 

Pi = (Pi++), P2 = (P-H+), P3 = (p++i)- 

Since each leaf sequence must have the same base composition for SBD, we 
set pi — p 2 = pi — p 3 = 0, obtaining two linear invariants 

h = Poio + Pon - Pioo - PlOl, /2 = Pooi + Pon - Pioo - PllO, 
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whose span includes that arising from P2 — P3 = 0. These are the invariants 
underlying the disparity index of [KG01]. 

From equations (3) and (4) we can also see that 

h = gfroVi + gihhivo + v^ 2 (5) 

is an invariant of degree 8. However, while h (/o, /1, f 2 ), the ideal (/o, /1, f 2 , h) 
is not the full ideal of invariants. Using Macaulay 2 [GS] to find the kernel of 
the ring map associated to (p quickly yields a single invariant of degree 6 with 
258 terms, which together with f , fi, and f 2 generates the full ideal defining 
V(SBD, 2, T). 

In fact, this invariant can be explained through the hyperdeterminants of 
[GKZ94]. For a 2 x 2 x 2 tensor such as P, the hyperdeterminant is given 
explicitly as 

Det(P) = (PoooPm + PooiPno +PoioP?oi +PonP?oo) 

- 2(p ooPooiPiioPm + P000P010P101P111 + P000P011P100P111 
+ P001P010P101P110 + P001P011P110P100 + P010P011P101P100) 

+ 4(poooPoiiPioiPno + PooiPoioPiooPm)- 

Now reasoning from equation (1) and using the invariance properties of Det(P) 
under the SL 2 x SL 2 x SL 2 action, one finds that in terms of model parameters, 

Det(P) = pV (1 - m!) 2 (l - m 2 ) 2 (l - m 3 ) 2 . 

Thus gig 2 c/3 — pq Det(P) = 0, and so g\g 2 gz + i>ii>o Det(P), viewed as a degree 
6 polynomial in the p^, is an invariant. Expressing this explicitly in terms of 
the pijk, we have the invariant 

/ 3 = det (p +i j) det(p i+J -) det(p ij+ ) - P0++P1++ Det(p ijfc ). 

A computation with Macaulay 2 now yields the following: 

Theorem 5 The ideal of phylogenetic invariants vanishing on V(SBD,2,T) 
for the 3-taxon tree is 

(/o, fi, /2, fz)- 

We thank a reviewer for pointing out that the 2x2x2 hyperdeterminant was 
introduced into a phylogenetic context in [SJ04], where it is called the tangle. 
That paper considers the 2-base GM model on a 3-taxon tree rooted along an 
edge, and proposes the hyperdeterminant as a generalized 'distance.' 
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For an n-taxon tree, determining the full ideal of invariants for the 2-base SBD 
model remains open. Of course, this model inherits the invariants of the GM 
model, which have been conjectured in [PS04] to be generated by 'edge invari- 
ants' arising from rank conditions on 2-dimensional flattenings of the tensor. 
This issue for GM will be dealt with in [AR04] . Additional invariants for SBD 
arise from applying the invariants of Theorem 5 to all 3-dimensional marginal- 
izations of the n-dimensional tensor P. One might suspect these generate the 
full ideal, but even for the 4-taxon tree we have been unable to confirm this 
computationally. 



4 The /t-base models, arbitrary k 

Proposition 6 Let T be an n-taxon tree, with E = E(n) = 2n — 3 the number 
of its edges. Denoting the dimension of the variety V(Ai, k, T) by d(A4, k, T), 

d(SBD, K , T) = {k - 1) + (k - 1) 2 E, 
d{SD, K , T) = d(ATR, k, T) = K ^ K ~ l "> + ( K _ \) E . 

PROOF. For fixed k and T, choose a root r for T. For each model M. we 
consider here, the parameter space Xm,k,t t C Xgm,k,t t is a semialgebraic 
subset of R M with M = k + k 2 E, as all our model assumptions are polynomial 
equalities or inequalities placing restrictions on the entries of p r and the M e . 

For each K,T r ), we will find a complex quasi-projective variety X of di- 
mension d = d{M., k, T) and a generically finite map ip : X — > C M , such that 
X M , K ,T r C ip(X) and ^^(Im,^) = X. These conditions imply 

so applying the map = (pGM,K,T r '■ C M — > C K ", we find 

J^W) = Wwv) = V(M, k, T). 

Since the results of [AR03] show is generically finite, the general fiber (0 o 
ip)^ 1 (P) is of dimension zero. Using a standard result on the dimension of fibers 
of regular maps (see [Har92], for instance), we conclude that the dimension of 
X and V(M, k,T) are the same. 

We begin with the SBD model, so d = {k - 1) + (k - l) 2 E. Let 

X={xeC d | X>^1}, 
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and define the map ip as follows: ip(x) = (p, {M e }) where pi = Xi for % = 
1, . . . , k — 1, and the upper left (k — 1) x (k — 1) blocks of each M e are given by 
successive entries in x. Use the conditions that J2iPi — 1, J2j M e (i,j) = 1, and 
pM e = p to give rational formulas for the remaining entries of p and M e in 
terms of x. Clearly is 1-1 and X S bd,k,t t C ^PO- Moreover, ^~ l {X S BD,K,T r ) 
is dense in X since it contains a Euclidean-open subset of the real points of 
X, which is Zariski dense in X. 

For the ATR model, with d = + (k - 1)E, let 

X = {(Q,u) | Q G K (C), Q = ( % ), Qil + 0, u G C^ 1 ^}. 

Here K (C) is the variety of complex orthogonal kxk matrices, which has di- 
mension Define ip by: ^(Q, u) = (p, {M e }) where p = (q^, q\ x , . . . , q 2 Kl ), 
and, with D = diag(gn, q 2 i ■ ■ ■ , q K i), 

M ei = D~ l Q diag(l, x j+1 , x j+2 , x j+K -i)Q T D, 

where j = (n — — 1). That ip is generically finite is clear, and that 
X ATR,K,T r Q V'(^) follows from the discussion in section 2. Also, the set 
ip^i^XATR^Tr) is dense in X since it contains a Euclidean-open subset of 
the real points of X, which is Zariski dense in X. 

Finally, for the SD model, recall [Jac75] that a family of real commuting 
normal matrices A Ci can be simultaneously expressed as = QBiQ T , with 
Q G K (R), and the Bi real block diagonal matrices with the same block 
structure, where each diagonal block is either 1 x 1 or 2 x 2 of the form 
The block structures we need to consider will have n 1 x 1 blocks, the first of 
which is 1, followed by m 2 x 2 blocks, where n > 1, m > 0, and n + 2m = k. 
Proceeding similarly to the case of the ATR model, for each of these |_^"J 
possible block structures B, we let Xg denote a copy of X as defined for ATR, 
and define a map ipg : X& — > C M similar to the ATR map, where the entries in 
u give the independent block entries in B { and M e . = D' 1 QB i Q T D. Letting 
X be the disjoint union of the X B , we obtain a map ip : X — > C M . The rest 
of the argument is similar to that for the ATR model. □ 

The construction of the varieties X and maps ip in this proof also yield 

Proposition 7 The varieties V(SBD, k, T) and V(ATR, k, T) are irreducible, 
but V(SD, k,T) is the union of L^-pJ distinct irreducible components, one of 
which is V(ATR,k,T). 

For each of the SBD, SD, and ATR K-base models on a tree T, we can construct 
a few phylogenetic invariants, though we are far from a full understanding 
of the ideals and varieties. Since any submodel inherits all invariants of a 
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supermodel, and SBD D SD D ATR, we consider the models in that order. 
In addition, since these are all submodels of GM, all GM invariants on T, such 
as those of [AR03,AR04], are also invariants of these models. 

SBD model: We first consider the 3-taxon tree T r , rooted at the central node 
and reason similarly to section 3. If (pijk) = 4>{Vr\ M 1: M 2 , M 3 ), then we have 

Pi++ — P+i+ — P++ij 

giving 2(k — 1) independent linear invariants expressing equality of base dis- 
tributions at the leaves. We can also construct an invariant from the hyperde- 
terminant Det(pjjfc) oiikxkxk tensors. Letting m = m(/t) denote the degree 
of this polynomial (so m(3) = 36 and m(4) = 272), then as before we find 

Det(p ijk ) = (det(M 1 )det(M 2 )det(M 3 )det( J D r )) m/K . 

Similarly, 

det(p ij+ ) = det(Mi) det(M 2 ) det(D r ), 
det (pi +j ) = det(Mi) det(M 3 ) det(A-), 
det(p +ij ) = det(M 2 ) det(M 3 ) det(D r ), 

so 

(det(p ij+ ) det(p i+j ) det(p +u -)) m/(2K) - d[p i++ ) m/ W Det(p ijfc ) 

i 

is an invariant for the SBD model, since 2n divides m, as can be shown from 
formulas in [GKZ94] . 

To see this is not an invariant for the GM model, we check that it does not 
vanish for some GM parameters. Indeed, if the parameters are chosen so the 
entries of p r and £>i++ are positive, the Mj are non-singular, and det(D r ) ^ 
Ili then the invariant will be non-zero. 

A similar construction replacing Det with any relative invariant h of GL K x 
GL K x GL K acting on C K ®C K (8>C' C produces a phylogenetic invariant, provided 
h does not vanish on all diagonal tensors. If h does vanish on diagonal tensors, 
then h is already an invariant of the GM model. 

To obtain n-taxon invariants we can of course compose 3-taxon invariants with 
any marginalization map of n-dimensional tensors to 3-dimensional ones. 

SD model: Note that for any choice of 2 taxa aj, a k on T, if P = 0(p r , {M e }), 
where (p r , {M e }) e X S D,K,T r , then the 2-dimensional marginalization P jk of 
P obtained by summing over indices corresponding to all other taxa will be 
of the form D r M, where M is a product of matrices in the collection 

{M e | e G Edge(T r )} U {D~ l M^D r \ e G Edge(T r )}, 
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and D r = diag(p r ) = diag(pi+...+,p2+-+, • • • ,Pk+-+)- Thus all matrices in the 
collection 

{D~ 1 P jk | l<j<k<n}U {D;\P jk ) T \l<j <k<n} 

will commute. For each pair chosen from this set, we get a collection of poly- 
nomials of degree k + 1 from the statement of commutativity: For instance, 

(D r )- 1 P jk (D r )- 1 P lm = (Dr^P^iDr)' 1 ^, 

gives invariants from the entries of 

P^(det(D r )(D r )- 1 )P lm - P lm (d^(D r )(D r )- 1 )P jk . 

That some of these are not invariants of the SBD model when k > 2 can be 
verified, most easily for a 2-taxon tree by a generic choice of SBD parameters. 

ATR model: We consider first a 2-taxon tree, with P e <t>{X ATR ^ Tr ). Then 
P = D r M e , so the condition M e = D^MjD,. implies P = P T . The entries of 
this matrix equation then give linear invariants, which are not invariants of the 
SD model for k > 2, since there exist parameters for the SD model with M e ^ 
D~ l M]; D r . Composing these invariants with 2-dimensional marginalization 
maps gives linear invariants for an n-taxon tree. 
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